An empirical study of the effect of outliers on the misclassification error rate
نویسندگان
چکیده
An outlier is an observation that deviates so much from other observations that it seems to have been generated by a different mechanism. Outlier detection has many applications, such as data cleaning, fraud detection and network intrusion. The existence of outliers can indicate individuals or groups that exhibit a behavior that is very different from most of the individuals of the data set. Frequently, outliers are removed to improve accuracy of estimators, but sometimes, the presence of an outlier has a certain meaning, which explanation can be lost if the outlier is deleted. In this paper we study the effect of the presence of outliers on the performance of three well-known classifiers based on the results observed on four real world datasets. We use detection of outliers based on robust statistical estimators of the center and the covariance matrix for the Mahalanobis distance, detection of outliers based on clustering using the partitioning around medoids (PAM) algorithm, and two data mining techniques to detect outliers: Bay’s algorithm for distance-based outliers, and the LOF, a density-based local outlier algorithm.
منابع مشابه
Detection of Outliers and Influential Observations in Linear Ridge Measurement Error Models with Stochastic Linear Restrictions
The aim of this paper is to propose some diagnostic methods in linear ridge measurement error models with stochastic linear restrictions using the corrected likelihood. Based on the bias-corrected estimation of model parameters, diagnostic measures are developed to identify outlying and influential observations. In addition, we derive the corrected score test statistic for outliers detection ba...
متن کاملتحلیل وضعیت آنژین صدری بر اساس احتمالات طبقه بندی نادرست عامل خطر سیگار در مطالعه قند و لیپید تهران، 79-1378
Misclassification of disease status and risk factors is one of the main sources of error in studies. Wrong assignment of individuals into exposed and non-exposed groups may seriously distort the results in case-control studies. This study investigates the effect of misclassification error on odds ratio estimates and attempts to introduce a correction method. Data on 3332 men aged 30-69 years fr...
متن کاملA New Formulation for Cost-Sensitive Two Group Support Vector Machine with Multiple Error Rate
Support vector machine (SVM) is a popular classification technique which classifies data using a max-margin separator hyperplane. The normal vector and bias of the mentioned hyperplane is determined by solving a quadratic model implies that SVM training confronts by an optimization problem. Among of the extensions of SVM, cost-sensitive scheme refers to a model with multiple costs which conside...
متن کاملIdentification of outliers types in multivariate time series using genetic algorithm
Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...
متن کاملSeparating Well Log Data to Train Support Vector Machines for Lithology Prediction in a Heterogeneous Carbonate Reservoir
The prediction of lithology is necessary in all areas of petroleum engineering. This means that to design a project in any branch of petroleum engineering, the lithology must be well known. Support vector machines (SVM’s) use an analytical approach to classification based on statistical learning theory, the principles of structural risk minimization, and empirical risk minimization. In this res...
متن کامل